A Multi-Level Boundary Classification Approach to Information Extraction
نویسنده
چکیده
Information Extraction (IE) is the process of identifying a set of pre-defined relevant items in text documents. We investigate the application of Machine Learning classification techniques to the problem of Information Extraction. In particular we use Support Vector Machines and several different feature-sets to build a set of classifiers for Information Extraction (IE). We show that this approach is competitive with current state-of-the-art Information Extraction algorithms based on specialized learning algorithms. We investigate the different components of our IE system, such as learning algorithm, feature-set and instance selection, and compare how much each component contributes to performance. We also introduce a new multi-level classification technique for improving the recall of IE systems. We show that this can give significant improvement in the performance of our IE system and gives a system with both high precision and high recall. Our system (ELIE) is an adaptive Information Extraction algorithm that uses a two-level boundary classification approach to learning. ELIE first classifies every document position as the start of a fragment to be extracted, the end of a fragment, or neither. This first level of extraction typically has high precision but mediocre recall. To increase recall, we employ a second level of classification. Positions near those positions extracted at the first level are classified by a second pair of classifiers that are biased for high recall. For example, the positions “downstream” from each extracted start position are classified in order to find the end of the given fragment. Our results on several benchmark corpora indicate that ELIE often outperforms state-of-the-art competitors.
منابع مشابه
Feature extraction of hyperspectral images using boundary semi-labeled samples and hybrid criterion
Feature extraction is a very important preprocessing step for classification of hyperspectral images. The linear discriminant analysis (LDA) method fails to work in small sample size situations. Moreover, LDA has poor efficiency for non-Gaussian data. LDA is optimized by a global criterion. Thus, it is not sufficiently flexible to cope with the multi-modal distributed data. We propose a new fea...
متن کاملMulti-level Boundary Classification for Information Extraction
We investigate the application of classification techniques to the problem of information extraction (IE). In particular we use support vector machines and several different feature-sets to build a set of classifiers for IE. We show that this approach is competitive with current state-of-the-art IE algorithms based on specialized learning algorithms. We also introduce a new technique for improv...
متن کاملEEG Based Brain Computer Interface Hand Grasp Control: Feature Extraction Method MTCSP
Brain-Computer Interfaces (BCIs) are communication systems, which enable users to send commands to computers by using brain activity only; this activity being generally measured by Electroencephalography (EEG). BCIs are generally designed according to a pattern recognition approach, i.e., by extracting features from EEG signals, and by using a classifier to identify the user’s mental state from...
متن کاملUrban Vegetation Recognition Based on the Decision Level Fusion of Hyperspectral and Lidar Data
Introduction: Information about vegetation cover and their health has always been interesting to ecologists due to its importance in terms of habitat, energy production and other important characteristics of plants on the earth planet. Nowadays, developments in remote sensing technologies caused more remotely sensed data accessible to researchers. The combination of these data improves the obje...
متن کاملEEG Based Brain Computer Interface Hand Grasp Control: Feature Extraction Method MTCSP
Brain-Computer Interfaces (BCIs) are communication systems, which enable users to send commands to computers by using brain activity only; this activity being generally measured by Electroencephalography (EEG). BCIs are generally designed according to a pattern recognition approach, i.e., by extracting features from EEG signals, and by using a classifier to identify the user’s mental state from...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006